Editorial: Use consistent notation for code points. by rkirsling · Pull Request #1724 · tc39/ecma262

rkirsling · 2019-10-07T04:27:41Z

Re-evaluating all cases of `"foo"` throughout the spec (so as to switch to *"foo"* in one fell swoop) reveals that our notation for code ~~units~~ points isn't entirely consistent.

I'm starting with the maximal patch here for discussion's sake, but I'm happy to drop any (or all) of these changes; my only goal is to avoid blindly upgrading to asterisks without adequate consideration.

(CC: @jmdyck)

jmdyck · 2019-10-07T04:32:52Z

spec.html

          </li>
          <li>
-            It is a Syntax Error if FlagText of |RegularExpressionLiteral| contains any code points other than `"g"`, `"i"`, `"m"`, `"s"`, `"u"`, or `"y"`, or if it contains the same code point more than once.
+            It is a Syntax Error if FlagText of |RegularExpressionLiteral| contains any code points other than `g`, `i`, `m`, `s`, `u`, or `y`, or if it contains the same code point more than once.


These are code points, not code units. (So the status quo is kind of wrong as well, but this PR probably shouldn't propagate that.)

Agh, right, I am conflating the two (though in my defense, it's not just me—compare lines 25051 and 25076 here 😅). Is your vote to just worry about the code point cases and let (sequences of) code units be represented as strings?

Yup, that'd be my vote.

jmdyck · 2019-10-07T04:34:44Z

spec.html

      <h1>Pattern Semantics</h1>
      <p>A regular expression pattern is converted into an internal procedure using the process described below. An implementation is encouraged to use more efficient algorithms than the ones listed below, as long as the results are the same. The internal procedure is used as the value of a RegExp object's [[RegExpMatcher]] internal slot.</p>
-      <p>A |Pattern| is either a BMP pattern or a Unicode pattern depending upon whether or not its associated flags contain a `"u"`. A BMP pattern matches against a String interpreted as consisting of a sequence of 16-bit values that are Unicode code points in the range of the Basic Multilingual Plane. A Unicode pattern matches against a String interpreted as consisting of Unicode code points encoded using UTF-16. In the context of describing the behaviour of a BMP pattern &ldquo;character&rdquo; means a single 16-bit Unicode BMP code point. In the context of describing the behaviour of a Unicode pattern &ldquo;character&rdquo; means a UTF-16 encoded code point (<emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>). In either context, &ldquo;character value&rdquo; means the numeric value of the corresponding non-encoded code point.</p>
+      <p>A |Pattern| is either a BMP pattern or a Unicode pattern depending upon whether or not its associated flags contain a `u`. A BMP pattern matches against a String interpreted as consisting of a sequence of 16-bit values that are Unicode code points in the range of the Basic Multilingual Plane. A Unicode pattern matches against a String interpreted as consisting of Unicode code points encoded using UTF-16. In the context of describing the behaviour of a BMP pattern &ldquo;character&rdquo; means a single 16-bit Unicode BMP code point. In the context of describing the behaviour of a Unicode pattern &ldquo;character&rdquo; means a UTF-16 encoded code point (<emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>). In either context, &ldquo;character value&rdquo; means the numeric value of the corresponding non-encoded code point.</p>


This is (arguably) a code point, not a code unit.

jmdyck · 2019-10-07T05:07:06Z

To denote a code unit, we currently have a long way (e.g. [the code unit] 0x002E (FULL STOP)) and a short way (e.g. `"."`). The latter notation also denotes a string, but context allows us to infer "the sole code unit of the string ...".

This PR proposes a different short notation, i.e. the current notation minus the quotes. But that is already the notation we use for terminal symbols and ES code snippets, i.e. source text, i.e. Unicode code points. I don't think it would be a good idea to use the same notation for code units and code points.

Instead, I think it would be okay for these cases to just change the backticks to asterisks as part of the one fell swoop, with the possible exception of the two cases noted.

rkirsling · 2019-10-07T05:23:11Z

spec.html

        </emu-alg>
        <emu-note>
-          <p>The code point `"#"` is not decoded from escape sequences even though it is not a reserved URI code point.</p>
+          <p>The code point `#` is not decoded from escape sequences even though it is not a reserved URI code point.</p>


Seems like we should update code point → code unit on this line for consistency with surroundings (and justification for dropping the proposed change here).

It's a bit tricky. Although the input and output of Decode are just String values, they are different ways of encoding a ("true") URI consisting of Unicode code points. So it's somewhat ambiguous whether "decoded" in this sentence is talking about the whole effect of Decode, or a particular point within it at which an escape sequence could be (but isn't) decoded into a Unicode code point.

Given the placement of the note, it seems more plausible that it would be talking about the whole effect of Decode, rather than an internal detail, in which case changing the first occurrence of "code point" to "code unit" (as you suggest) would be appropriate. (Though rewording the sentence to not have that ambiguity would be a bigger improvement, I think.)

Alternatively, maybe it's the parallel sentence below that should change in the opposite direction? Re-reading the sentence as a whole, it seems we could argue that the note is purely about the code point #, whereas the algorithm step is building a string from code units including "#".

Alternatively, maybe it's the parallel sentence below that should change in the opposite direction?

That is, on line 25076, change "code unit" to "code point".

Re-reading the sentence as a whole, it seems we could argue that the note is purely about the code point #,

Yup, it's debatable.

whereas the algorithm step is building a string from code units including "#".

The algorithm is clearly building a string, although the phrase each code unit valid in |uriReserved| is slightly problematic, if we assume (as with the spec's other lexical grammars) that the terminal symbols are code points, not code units. In effect, the 'uri' grammar is not there to specify a syntax (nothing is parsed with |uri| as the goal symbol), but merely to define |uriReserved| and |uriUnescaped| as sets of code units. But that's mostly beside the point.

spec.html

jmdyck · 2019-10-09T12:49:41Z

The net effect of these two commits looks good to me.

jmdyck

(Okay, this time as a review:) Looks good to me.

I'd recommend squashing the two commits to reduce churn, since the second undoes most of the first.

rkirsling · 2019-10-13T15:08:14Z

Added feedback from #1733 here.

jmdyck · 2019-10-13T18:47:03Z

Yup, looks good.

rkirsling · 2019-10-14T20:56:28Z

spec.html

        <emu-alg>
          1. Let _s_ be SourceText of |LoneUnicodePropertyNameOrValue|.
-          1. If ! UnicodeMatchPropertyValue(`"General_Category"`, _s_) is identical to a List of Unicode code points that is the name of a Unicode general category or general category alias listed in the &ldquo;Property value and aliases&rdquo; column of <emu-xref href="#table-unicode-general-category-values"></emu-xref>, then
+          1. If ! UnicodeMatchPropertyValue(`General_Category`, _s_) is identical to a List of Unicode code points that is the name of a Unicode general category or general category alias listed in the &ldquo;Property value and aliases&rdquo; column of <emu-xref href="#table-unicode-general-category-values"></emu-xref>, then


Added one more case noticed in #1733.

jmdyck · 2019-10-17T18:09:16Z

(Note that the PR was merged with the original title involving "code units" rather than the revised title involving "code points", which might be a source of confusion.)

ljharb · 2019-10-17T19:13:57Z

Oops, thanks. Hopefully the linked PR will resolve any potential future confusion.

jmdyck reviewed Oct 7, 2019

View reviewed changes

ljharb added the editorial change label Oct 7, 2019

rkirsling commented Oct 7, 2019

View reviewed changes

rkirsling commented Oct 8, 2019

View reviewed changes

spec.html Outdated Show resolved Hide resolved

ljharb requested a review from a team October 8, 2019 20:52

rkirsling changed the title ~~Editorial: Use consistent notation for code units.~~ Editorial: Use consistent notation for code points. Oct 9, 2019

rkirsling mentioned this pull request Oct 10, 2019

All string literals must be code formatted tc39/ecma402#54

Closed

ljharb requested a review from jmdyck October 11, 2019 04:16

rkirsling mentioned this pull request Oct 11, 2019

Editorial: Stylize strings as values. #1733

Merged

jmdyck approved these changes Oct 11, 2019

View reviewed changes

rkirsling force-pushed the style-fixes-for-code-units branch from d149859 to 93635e4 Compare October 13, 2019 15:07

ljharb self-assigned this Oct 14, 2019

ljharb approved these changes Oct 14, 2019

View reviewed changes

rkirsling force-pushed the style-fixes-for-code-units branch from 93635e4 to ede7059 Compare October 14, 2019 20:55

rkirsling commented Oct 14, 2019

View reviewed changes

Editorial: Use consistent notation for code units (tc39#1724)

cf8607e

ljharb force-pushed the style-fixes-for-code-units branch from ede7059 to cf8607e Compare October 16, 2019 05:34

ljharb merged commit cf8607e into tc39:master Oct 16, 2019

rkirsling deleted the style-fixes-for-code-units branch October 16, 2019 07:54

Conversation

rkirsling commented Oct 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmdyck Oct 7, 2019

Choose a reason for hiding this comment

Uh oh!

rkirsling Oct 7, 2019

Choose a reason for hiding this comment

Uh oh!

jmdyck Oct 7, 2019

Choose a reason for hiding this comment

Uh oh!

jmdyck Oct 7, 2019

Choose a reason for hiding this comment

Uh oh!

jmdyck commented Oct 7, 2019

Uh oh!

rkirsling Oct 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmdyck Oct 7, 2019

Choose a reason for hiding this comment

Uh oh!

rkirsling Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmdyck Oct 9, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jmdyck commented Oct 9, 2019

Uh oh!

jmdyck left a comment

Choose a reason for hiding this comment

Uh oh!

rkirsling commented Oct 13, 2019

Uh oh!

jmdyck commented Oct 13, 2019

Uh oh!

rkirsling Oct 14, 2019

Choose a reason for hiding this comment

Uh oh!

jmdyck commented Oct 17, 2019

Uh oh!

ljharb commented Oct 17, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rkirsling commented Oct 7, 2019 •

edited

Loading

rkirsling Oct 7, 2019 •

edited

Loading

rkirsling Oct 8, 2019 •

edited

Loading